Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size
نویسندگان
چکیده
The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast to previous results, we show that these expected values can be computed without matrix inversions. On the other hand, we show how these results can be algorithmically exploited to improve an exact motif discovery algorithm. First, the algorithm can be efficiently generalized to arbitrary finite-memory text models, whereas it was previously limited to i.i.d. texts. Second, we achieve a speed-up of up to a factor of 135. Our open-source (GPL) implementation is available at http://www.rahmannlab.de/software .
منابع مشابه
Algorithms and statistical methods for exact motif discovery
The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex b...
متن کاملEfficient exact motif discovery
MOTIVATION The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-d...
متن کاملDevelopment of an Efficient Hybrid Method for Motif Discovery in DNA Sequences
This work presents a hybrid method for motif discovery in DNA sequences. The proposed method called SPSO-Lk, borrows the concept of Chebyshev polynomials and uses the stochastic local search to improve the performance of the basic PSO algorithm as a motif finder. The Chebyshev polynomial concept encourages us to use a linear combination of previously discovered velocities beyond that proposed b...
متن کاملEDAM: An Efficient Clique Discovery Algorithm with Frequency Transformation for Finding Motifs
Finding motifs in DNA sequences plays an important role in deciphering transcriptional regulatory mechanisms and drug target identification. In this paper, we propose an efficient algorithm, EDAM, for finding motifs based on frequency transformation and Minimum Bounding Rectangle (MBR) techniques. It works in three phases, frequency transformation, MBR-clique searching and motif discovery. In f...
متن کاملSpeeding up MAP with Column Generation and Block Regularization
In this paper, we show how the connections between max-product message passing and linear programming relaxations for MAP allow for a more efficient exact algorithm than standard dynamic programming. Our proposed algorithm uses column generation to pass messages only on a small subset of the possible assignments to each variable, while guaranteeing to find the exact solution. This algorithm is ...
متن کامل